Stochastic Primal-Dual Methods and Sample Complexity of Reinforcement Learning

نویسندگان

  • Yichen Chen
  • Mengdi Wang
چکیده

We study the online estimation of the optimal policy of a Markov decision process (MDP). We propose a class of Stochastic Primal-Dual (SPD) methods which exploit the inherent minimax duality of Bellman equations. The SPD methods update a few coordinates of the value and policy estimates as a new state transition is observed. These methods use small storage and has low computational complexity per iteration. The SPD methods find an absolute-optimal policy, with high probability, using O ( |S||A|σ (1−γ)6 2 ) iterations/samples for the infinite-horizon discounted-reward MDP and O ( |S||A|Hσ 2 ) for the finite-horizon MDP.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Proximal Gradient Temporal Difference Learning Algorithms

In this paper, we describe proximal gradient temporal difference learning, which provides a principled way for designing and analyzing true stochastic gradient temporal difference learning algorithms. We show how gradient TD (GTD) reinforcement learning methods can be formally derived, not with respect to their original objective functions as previously attempted, but rather with respect to pri...

متن کامل

Primal-Dual π Learning: Sample Complexity and Sublinear Run Time for Ergodic Markov Decision Problems

Consider the problem of approximating the optimal policy of a Markov decision process (MDP) by sampling state transitions. In contrast to existing reinforcement learning methods that are based on successive approximations to the nonlinear Bellman equation, we propose a Primal-Dual π Learning method in light of the linear duality between the value and policy. The π learning method is model-free ...

متن کامل

Finite-Sample Analysis of Proximal Gradient TD Algorithms

In this paper, we show for the first time how gradient TD (GTD) reinforcement learning methods can be formally derived as true stochastic gradient algorithms, not with respect to their original objective functions as previously attempted, but rather using derived primal-dual saddle-point objective functions. We then conduct a saddle-point error analysis to obtain finite-sample bounds on their p...

متن کامل

Accelerated Primal-Dual Policy Optimization for Safe Reinforcement Learning

Constrained Markov Decision Process (CMDP) is a natural framework for reinforcement learning tasks with safety constraints, where agents learn a policy that maximizes the long-term reward while satisfying the constraints on the long-term cost. A canonical approach for solving CMDPs is the primal-dual method which updates parameters in primal and dual spaces in turn. Existing methods for CMDPs o...

متن کامل

Stochastic Variance Reduction Methods for Policy Evaluation

Policy evaluation is a crucial step in many reinforcement-learning procedures, which estimates a value function that predicts states’ longterm value under a given policy. In this paper, we focus on policy evaluation with linear function approximation over a fixed dataset. We first transform the empirical policy evaluation problem into a (quadratic) convex-concave saddle point problem, and then ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1612.02516  شماره 

صفحات  -

تاریخ انتشار 2016